hpc system
Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM
Trappen, Tim, Keßler, Robert, Pabel, Roland, Achter, Viktor, Wesner, Stefan
Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.
- North America > United States (0.70)
- Asia (0.68)
- Europe > Germany > North Rhine-Westphalia (0.14)
- Information Technology (0.95)
- Education > Educational Setting (0.50)
ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System
Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin
Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.
- North America > United States (0.14)
- Europe > United Kingdom (0.04)
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Energy (0.47)
- Telecommunications (0.47)
- Information Technology (0.46)
Improving the Efficiency of a Deep Reinforcement Learning-Based Power Management System for HPC Clusters Using Curriculum Learning
Budiarjo, Thomas, Pradata, Santana Yuda, Santiyuda, Kadek Gemilang, Amrizal, Muhammad Alfian, Pulungan, Reza, Takizawa, Hiroyuki
High energy consumption remains a key challenge in high-performance computing (HPC) systems, which often feature hundreds or thousands of nodes drawing substantial power even in idle or standby modes. Although powering down unused nodes can improve energy efficiency, choosing the wrong time to do so can degrade quality of service by delaying job execution. Machine learning, in particular reinforcement learning (RL), has shown promise in determining optimal times to switch nodes on or off. In this study, we enhance the performance of a deep reinforcement learning (DRL) agent for HPC power management by integrating curriculum learning (CL), a training approach that introduces tasks with gradually increasing difficulty. Using the Batsim-py simulation framework, we compare the proposed CL-based agent to both a baseline DRL method (without CL) and the conventional fixed-time timeout strategy. Experimental results confirm that an easy-to-hard curriculum outperforms other training orders in terms of reducing wasted energy usage. The best agent achieves a 3.73% energy reduction over the baseline DRL method and a 4.66% improvement compared to the best timeout configuration (shutdown every 15 minutes of idle time). In addition, it reduces average job waiting time by 9.24% and maintains a higher job-filling rate, indicating more effective resource utilization. Sensitivity tests across various switch-on durations, power levels, and cluster sizes further reveal the agent's adaptability to changing system parameters without retraining. These findings demonstrate that curriculum learning can significantly improve DRL-based power management in HPC, balancing energy savings, quality of service, and robustness to diverse configurations.
- Asia > Indonesia (0.28)
- Asia > Singapore (0.18)
- North America > United States > New York > New York County > New York City (0.14)
- (2 more...)
- Electrical Industrial Apparatus (0.82)
- Government > Regional Government > North America Government > United States Government (0.46)
- Energy > Oil & Gas > Upstream (0.34)
Toward Smart Scheduling in Tapis
Stubbs, Joe, Padhy, Smruti, Cardone, Richard
The Tapis framework provides APIs for automating job execution on remote resources, including HPC clusters and servers running in the cloud. Tapis can simplify the interaction with remote cyberinfrastructure (CI), but the current services require users to specify the exact configuration of a job to run, including the system, queue, node count, and maximum run time, among other attributes. Moreover, the remote resources must be defined and configured in Tapis before a job can be submitted. In this paper, we present our efforts to develop an intelligent job scheduling capability in Tapis, where various attributes about a job configuration can be automatically determined for the user, and computational resources can be dynamically provisioned by Tapis for specific jobs. We develop an overall architecture for such a feature, which suggests a set of core challenges to be solved. Then, we focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues, and we present two sets of results based on machine learning methods. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options. Our second set of results frames the problem as a classification, allowing us to compare the use of an existing system with a dynamically provisioned resource.
- North America > United States > Texas > Travis County > Austin (0.15)
- North America > United States > Texas > Shelby County > Center (0.05)
I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey
Lewis, Noah, Bez, Jean Luca, Byna, Suren
Because of the increased popularity of Machine Learning (ML) workloads, there is a rising demand for I/O systems that can effectively accommodate their distinct I/O access patterns. Write operation bursts commonly dominate traditional workloads; however, ML workloads are usually read-intensive and use many small files [99]. Due to the absence of a well-established consensus on the preferred I/O stack for ML workloads, numerous developers resort to crafting their own ad-hoc algorithms and storage systems to cater to the specific requirements of their applications [50]. This can result in sub-optimal application performance due to the under-utilization of the storage system, prompting the necessity for novel I/O optimization methods tailored to the demands of ML workloads. In Figure 1, we show the evolving I/O stack used for running ML workloads (on the right side) in comparison with the traditional HPC I/O stack (on the left side). Traditional HPC I/O stack has been developed to support massive parallelism. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Washington > King County > Renton (0.04)
- (15 more...)
- Energy (0.93)
- Information Technology > Services (0.67)
- Government > Regional Government > North America Government > United States Government (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
KPIs-Based Clustering and Visualization of HPC jobs: a Feature Reduction Approach
Halawa, Mohamed Soliman, Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana
High-Performance Computing (HPC) systems need to be constantly monitored to ensure their stability. The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc. A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues. In this paper, we introduce a methodology to cluster HPC jobs according to their KPI indicators. Our approach reduces the inherent high dimensionality of the collected data by applying two techniques to the time series: literature-based and variance-based feature extraction. We also define a procedure to visualize the obtained clusters by combining the two previous approaches and the Principal Component Analysis (PCA). Finally, we have validated our contributions on a real data set to conclude that those KPIs related to CPU usage provide the best cohesion and separation for clustering analysis and the good results of our visualization methodology.
- Europe (0.28)
- Africa > Middle East > Egypt (0.14)
Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers
Halawa, Mohamed S., Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana
Performance analysis is an essential task in High-Performance Computing (HPC) systems and it is applied for different purposes such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of Key Performance Indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we have applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician Computation Center (CESGA). We have concluded that (i) those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms are the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.
- Europe > Spain (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
Benchmarking Performance of Deep Learning Model for Material Segmentation on Two HPC Systems
Williams, Warren R., Glandon, S. Ross, Morris, Luke L., Cheng, Jing-Ru C.
Performance Benchmarking of HPC systems is an ongoing effort that seeks to provide information that will allow for increased performance and improve the job schedulers that manage these systems. We develop a benchmarking tool that utilizes machine learning models and gathers performance data on GPU-accelerated nodes while they perform material segmentation analysis. The benchmark uses a ML model that has been converted from Caffe to PyTorch using the MMdnn toolkit and the MINC-2500 dataset. Performance data is gathered on two ERDC DSRC systems, Onyx and Vulcanite. The data reveals that while Vulcanite has faster model times in a large number of benchmarks, and it is also more subject to some environmental factors that can cause performances slower than Onyx. In contrast the model times from Onyx are consistent across benchmarks. 1. Introduction The demand for intelligent devices and tools that will facilitate safer work environments, safer roadways, and an increased quality of life is ever growing.
- North America > United States > Mississippi > Warren County > Vicksburg (0.04)
- North America > United States > Florida > Alachua County > Gainesville (0.04)
Online Job Failure Prediction in an HPC System
Antici, Francesco, Borghesi, Andrea, Kiziltan, Zeynep
Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system. The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Experimental results show that our approach is promising.
Development of Authenticated Clients and Applications for ICICLE CI Services -- Final Report for the REHS Program, June-August, 2022
Samar, Sahil, Chen, Mia, Karpinski, Jack, Ray, Michael, Sarin, Archita, Garcia, Christian, Lange, Matthew, Stubbs, Joe, Thomas, Mary
The Artificial Intelligence (AI) institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) is funded by the NSF to build the next generation of Cyberinfrastructure to render AI more accessible to everyone and drive its further democratization in the larger society. We describe our efforts to develop Jupyter Notebooks and Python command line clients that would access these ICICLE resources and services using ICICLE authentication mechanisms. To connect our clients, we used Tapis, which is a framework that supports computational research to enable scientists to access, utilize, and manage multi-institution resources and services. We used Neo4j to organize data into a knowledge graph (KG). We then hosted the KG on a Tapis Pod, which offers persistent data storage with a template made specifically for Neo4j KGs. In order to demonstrate the capabilities of our software, we developed several clients: Jupyter notebooks authentication, Neural Networks (NN) notebook, and command line applications that provide a convenient frontend to the Tapis API. In addition, we developed a data processing notebook that can manipulate KGs on the Tapis servers, including creations of a KG, data upload and modification. In this report we present the software architecture, design and approach, the successfulness of our client software, and future work.
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (4 more...)
- Information Technology > Security & Privacy (0.73)
- Education > Educational Setting (0.47)